Audiovisual speech recognition using multiscale nonlinear image decomposition
نویسندگان
چکیده
There has recently been increasing interest in the idea of enhancing speech recognition by the use of visual information derived from the face of the talker. This paper demonstrates the use of nonlinear image decomposition, in the form of a ‘sieve’, applied to the task of visual speech recognition. Information derived from the mouth region is used in visual and audiovisual speech recognition of a database of the letters A-Z for four talkers. A scale histogram is generated directly from the grayscale pixels of a window containing the talkers mouth on a per frame basis. Results are presented for visual-only, audio-only and in a simple audiovisual case.
منابع مشابه
Scale Based Features for Audiovisual Speech Recognition
This paper demonstrates the use of nonlinear image decomposition, in the form of a sieve, applied to the task of audiovisual speech recognition of a database of the letters A–Z for ten talkers. A scale based feature vector is formed directly from the grayscale pixels of an image containing the talkers mouth on a per frame basis. This is independent of image amplitude and position information an...
متن کاملA Multiscale Image Representation Using Hierarchical (BV, L2 ) Decompositions
We propose a new multiscale image decomposition which offers a hierarchical, adaptive representation for the different features in general images. The starting point is a variational decomposition of an image, f = u0 + v0, where [u0, v0] is the minimizer of a J-functional, J(f, λ0;X,Y ) = infu+v=f { ‖u‖X + λ0‖v‖pY } . Such minimizers are standard tools for image manipulations (e.g., denoising, ...
متن کاملLipreading Using Shape, Shading and Scale
This paper compares three methods of lipreading for visual and audio-visual speech recognition. Lip shape information is obtained using an Active Shape Model (ASM) lip tracker but is not as effective as modelling the combined shape and enclosed greylevel surface using an Active Appearance Model (AAM). A nontracked alternative is a nonlinear transform of the image using a multiscale spatial anal...
متن کاملModelling asynchrony in speech using elementary single-signal decomposition
Although the possibility of asynchrony between different components of the speech spectrum has been acknowledged, its potential effect on automatic speech recogniser performance has only recently been studied. This paper presents the results of continuous speech recognition experiments in which such asynchrony is accommodated using a variant of HMM decomposition. The paper begins with an invest...
متن کاملClassification of emotional speech using spectral pattern features
Speech Emotion Recognition (SER) is a new and challenging research area with a wide range of applications in man-machine interactions. The aim of a SER system is to recognize human emotion by analyzing the acoustics of speech sound. In this study, we propose Spectral Pattern features (SPs) and Harmonic Energy features (HEs) for emotion recognition. These features extracted from the spectrogram ...
متن کامل